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1. INTRODUCTION 

Over the years, the adoption of battery electric vehicles (BEVs) has been growing, but a major 
hindrance to their promotion and usage is the issue of inaccurate display of residual power. This problem 
contributes to range anxiety among drivers, caused by uncertainties in battery performance and other factors. 
The goal of this study is to tackle this problem by creating a model that can precisely predict the driving 
range of BEVs.This study introduces advanced machine learning (ML) techniques for accurately estimating 
the mileage of electric vehicles (EVs) by considering both internal and external factors. These factors include 
the use of heating, average speed, air conditioning, energy consumption, and route type. With better battery 
technology and the demand for minimal or zero-emission vehicles, EVs are a strong contender to take the 
place of combustion engine-powered engines. Despite these vehicles' advantages, the general public has not 
given them much popularity. Due to the limited infrastructure for charging and consequently shorter driving 
range, BEV drivers may have range anxiety, or worry that the battery capacity may deplete before reaching 
their destination [1]. To minimize range anxiety and increase the usability of EVs, applications are needed 
that help drivers reach their destinations safely without wasting a great deal of time or money. These 
applications’ primary goals are to improve and accurately predict an efficient driving range. Drivers often 
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save as much as twenty percent of the charge in their batteries as a precautionary measure [2], which has a 
negative impact on how efficiently the battery uses energy. 

The study of how to increase the capacity of batteries or driving range for EVs is based on the 
driving habits of EV users. To be able to optimally utilize battery capacity, Li et al. [3] presented an 
integrated distribution model that described the daily trip miles. The outcomes of the tests demonstrated the 
way the mixed distribution model was capable of meeting various drivers' demands. Furthermore, Dong and 
Lin [4] created the concept of BEV viability by employing a stochastic modelling approach to characterize 
the behaviours of BEV drivers. In order to find ways to lessen range anxiety, the comfort levels of drivers 
with various driving traits were examined. However, the researchers discovered that the factors are linked 
even if the driving behaviour that distinguishes BEVs is stochastic. Brady and O’Mahony [5] used a 
stochastic modelling approach after studying the dependency structure between the six variables using a 
nonparametric copula function. The result was a daily trip itinerary and billing profile. 

The most thorough approach to reducing air pollution is to deploy EVs. Governments are thus 
promoting the purchase and usage of these vehicles in place of cars with internal combustion engines [6]. 
EV sales reportedly increased 72% globally in 2018 in contrast to 2017, and they saw a 2.1% rise in market 
share [7]. The small market share of electric vehicles may seem odd given the benefits listed above and the 
presence of large companies in the sector, but it is due to a number of factors, the most prominent of which is 
their high purchase costs, prolonged charging times when compared to cars powered by fossil fuels, and their 
limited range per charge [8]. For data-driven predictions, like those generated by ML systems, a large 
training dataset is preferred [9]. A few papers have suggested data sharing between cars and the cloud so that 
users might gain from the knowledge of other consumers, ultimately producing forecasts that are more 
correct. By gathering data on BEVs' energy usage while navigating a road stretch, Grubwinkler et al. [10] 
presented an energetic route map built through crowdsourcing. To collect data from the general public for the 
forecasting of vehicle energy consumption, Tseng and Chau [11] used the participatory sensing approach. 
Straub et al. [12] proposed an alternative approach to developing an energetic roadmap by collecting driving 
profiles from the crowd and using machine learning techniques to fill in gaps in the information. This method 
effectively removed any potential limitations in the coverage of data, resulting in a more accurate and reliable 
energy roadmap. 

In recent years, data-driven methods have become more widely used as an effective way of 
estimating consumption and gauging driving range. The rationale is that when compared to more traditional 
ways, they are more reliable and cost-effective, and this is because the internet of things innovations have 
reduced the costs associated with deployment. To reduce the expenses associated with installing sensors and 
transferring data from cars, a considerable amount of information is extracted from the vehicle's network and 
transmitted to the cloud. This data may then be processed by machine learning algorithms to offer a variety 
of helpful services [13]. One of the main problems with ML is the uneven distribution of the training dataset. 
In general, machine learning models’ ability to accurately predict outcomes on testing data would suffer if the 
distributions of the training and testing sets are different. 


2. REGRESSION MODELS 

When predicting a target variable that is continuous based on a number of input variables, regression 
models are often employed in the analysis of data and ML. In this work, a number of well-liked regression 
models for estimating the motor range of electric cars using a variety of input characteristics are investigated. 
To formulate the motor range of EVs [14], [15], some of the machine learning algorithms incorporated are - 
linear regression, random forest, and deep multi-layer perceptron (MLP). The last two of which are 
wholesome techniques. Linear regression algorithm is an ML method that aims to apply relationships to 
illustrate the outcome of an event on the basis of data for the independent variables. The observed fitted line 
is a straight line that closely approximates the individual data points [16]. The aim of the algorithm is to 
reduce the mathematical disparity between the actual values provided by the manufacturer, and predicted 
values, and it is given by (1). 


Y = Bo + BX, + BX +o + BnXn + € (1) 

Here, the dependent parameter Y stands in for the EV’s driving range. X,, X>, ..., X, are the 
independent variables that affect driving range. Bo, 61, Bo, ..., Bn are the coefficients of the independent 
variables. ¢ is the disturbance term or error variable in the data. The coefficients B0, B1, B2, ..., Bn are 


computed to reduce the total squared deviations between the actual and predicted values [17], [18]. MLP is a 
neural network made up of several connected layers which change the input dimension into the desired 
dimension. Neurons (or nodes) are conjoined to form a neural network in such a manner that some of the 
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outputs are also feeded as their input. One node serves as an input, one node serves as an output, and there 
may be any number of hidden layers, each with any number of nodes. 

Deep MLP can capture complex correlations between elements like power, trip distance, energy 
consumption, and driving style, leading to more accurate range estimates. By using its deep architecture and 
non-linear activation functions, the method offers the potential to uncover odd patterns and correlations in the 
data that may be difficult to capture with linear models like linear regression. Four hidden layers, each with 
64 neurons, were used to apply the rectified linear unit (ReLU) activation function [19]. The benefit of ReLU 
over other activation functions, such as the sigmoid or hyperbolic tangent, is that it enables the network to 
learn more rapidly and avoids the saturation issue. The ReLU function is defined as max (0.0, x), where x 
provides the input to the activation function. It returns the input value if it is positive, otherwise zero. The 
research shows that mini-batches may be handled well during training thanks to the usage of the Adam 
optimization algorithm with a 32-batch size. The size of a batch is a reflection of the quantity of samples used 
in each iteration. The next algorithm which is used in this report for the comparative study of the driving 
range of EVs is random forest (RF) [20], [21]. Random forest regression is a supervised learning approach 
that uses collective learning, which integrates predictions from different machine learning models to enhance 
the accuracy of predictions in regression situations. To create random forest regression, we imported the 
random forest regressor class from the sklearn package, made an instance of it, and assigned it to a variable. 
In this scenario, we put the n_estimators argument to 50, which indicates our random forest would consist of 
50 trees. Using the fit() method, we train the model by modifying the weights depending on the data values to 
boost accuracy. Once the training is complete, our model is ready to generate predictions based on the 
learned patterns from the training data. 


3. DATASET DESCRIPTION AND PREPROCESSING 

We conducted our research using a publicly available dataset called SpritMonitor, which lists a few 
electric vehicles that have received the most fueling records from various consumers, representing various 
cars [14]. Due to its broad collection of data on vehicle fueling, SpritMonitor came out as the most 
appropriate alternative among the many datasets we took into consideration. On the crowdsourcing website 
SpritMonitor, users may provide details on the makes, models, features, and fuel use of their vehicles. Each 
record includes details such as the distance traveled after the last fill-up, the amount of petrol used, the kind 
of tire and petrol, and other relevant statistics. The dataset, which is a useful resource for our investigation, 
mostly includes information from well-known and commonly used car types. Similar to earlier studies [22], it 
is essential to train distinct machine learning models for each type of electric car owing to the substantial 
variances among them. We decided to use the Volkswagen e-Golf for our subsequent investigation. 
This decision was based on the records-to-users ratio since a larger ratio produces a dataset that is more 
evenly distributed. 


3.1. Data collection and variables 

The dataset for this research was gathered using crawlers, which took data from numerous sources 
on electric automobiles. Since we are using supervised learning in this case, the model output should be as 
near to the target output as possible [23]. The dataset is stored in a CSV file and contains a range of variables 
affecting EV driving range. 


3.2. Data preprocessing techniques 

Due to abnormal data points brought on by device failures, the raw dataset cannot be utilized 
directly. Before doing exploratory data analysis (EDA), the dataset has to be preprocessed to preserve data 
consistency and quality. Thus, the raw data was processed by handling missing values, removing useless 
columns, and labeling the target variable [24]. Given that ‘avg speed(km/h)’ had numerical missing values, 
we filled them using mean and median imputation based on whether the data distribution is symmetric or 
skewed using the box plot in Figure 1. We chose the mean value to impute missing values because the data is 
symmetric within the range of 40-60 km/h. For outliers data points we used median imputation [25]. 


3.3. Exploratory data analysis (EDA) approaches 

In order to learn more about the dataset, comprehend the connections between variables, and spot 
anomalies or patterns that can have an impact on the range prediction models, EDA approaches are used. The 
EDA approaches used in this study include: 


3.3.1. Univariate 
During the univariate analysis, we evaluated individual variables in the dataset to determine their 
distributions and features. As an example, we estimated the mean, median, and standard deviation of 
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descriptive statistics for variables like average speed, quantity, and energy consumption rate [26]. We also 
visualized the distributions using histograms, box plots, and density plots to identify any outliers or skewness 


in the data as shown in Figure 1. 
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Figure 1. Density and box plot of consumption (kwh/100 km) 


3.3.2. Bivariate 


In the bivariate analysis, we concentrated on analyzing the interactions between pairs of variables to 
find linkages and dependencies. For instance, we examined if certain parameters had an effect on the range of 
electric cars by using scatter plots to visualize the link between trip distance and other characteristics. The 
amount beyond 20 kWh is directly proportional to the travel distance. In Figure 2, quantity in the 0 to 20 
range may not be able to calculate the journey distance by itself, thus we incorporated some other 
characteristics to do so. In EVs with an energy consumption range of 10 to 20, the traveled distance is greater 
When the air conditioner is running and the park heating is not turned on, the energy usage is greater. 
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Figure 2. Scattered plot of car battery info vs driving range 


3.3.3. Multivariate 


We studied interactions between three or more variables in the multivariate analysis to comprehend 
intricate patterns and relationships [27], [28]. For instance, we visualized the correlation matrix between 
variables like trip distance with other variables and auxiliary loads using heat maps in Figure 3. As a result, 


Exploratory data analysis for electric vehicle driving range prediction 


... (Debani Prasad Mishra) 


478 0 ISSN: 2252-8792 


we were able to pinpoint the factors that were strongly connected and may be significantly affecting the EV 
driving range [29], [30]. It is noticed that very few outliers over 50 to 100 km/h in energy consumption 
and the outliers in quantity are in the range of greater than or equal to 40. Few outliers are below 10 and over 
80 km/h in average speed. Figure 4 depicts the EV range prediction function's flow pattern. We generated a 
clean and relevant dataset for further research and model building by using these data pre-processing 
approaches and conducting thorough exploratory data analysis. 
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Figure 3. Correlation matrix 
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Figure 4. Proposed processing structure for predicting energy consumption in EVs 
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4. RESULTS AND DISCUSSION 

To evaluate the performance of different regression models for EV range prediction, we compared 
linear regression, random forest (RF), and deep multilayer perceptron (Deep MLP) algorithms. The efficacy 
of the models is evaluated using regression measures [31]. These metrics compute the prediction error, or the 
difference between actual and predicted values. Since our focus is on minimizing significant outlier errors, 
mean squared error (MSE) is a preferable option over mean absolute error (MAE). A quantity known as the 
MSE measures the average of the squared discrepancies between the output that was anticipated and the 
output that was actually produced. The squared error is preferred because it doesn't differentiate between 
overestimations or underestimations but simply indicates that the prediction was inaccurate. 


MSE (Vere: Vorea) = ares uOrue ~ Veven (2) 
Here Yjrue denotes the true value target variable and y,;eq is the predicted value or the output. Lower the 
MSE value, the closer is the predicted value to the actual result. The R? score is the next evaluation criterion, 
which measures how much of the target variable's fluctuation can be accounted for by the model's 
characteristics [32]. It provides an indication of how well the model performs in explaining the variability of 
the outcome variable and is formulated as (3) and (4). 


LOtrue-Y red)” 
R2 : =1-— Pp 
(Viens Visa) L(vtrue-¥)2 (3) 


y= ——_Dyorwe (4) 
samples 

The performance goes on increasing as this R* score reaches 1. Table 1 displays the results of our 
comparative analysis of multiple regression models for EV range prediction. In comparing the EV range's 
actual and anticipated values, the graphs provide visual representations of the relationship between the two 
variables. Figure 5 displays a scattered plot, showcasing the actual versus predicted values using linear 
regression. Moving on to Figure 6, a line plot illustrates the actual versus predicted values using random 
forest regression. Lastly, Figure 7 presents a scattered plot depicting the actual versus predicted values 
obtained from Deep MLP. By analyzing the distribution and patterns of the data points around the reference 
line, one can gain a better understanding of the performance and reliability of the regression models in 
capturing the true EV range. These graphs offer insights into the accuracy and performance of the different 
regression models in predicting EV range, inviting further exploration and analysis. 


Table 1. Performance evaluation of the proposed model 
Regression Models _ Root Mean Squared Error(RMSE) _ Mean Squared Error (MSE) _ R- Squared 


Linear Regression 17.7274 314.2621 0.8451 
Random Forest 13.6498 186.3196 0.9082 
Deep MLP 11.8738 140.9893 0.9305 
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Figure 5. Linear regression’s actual vs predicted Figure 6. Random forest’s actual vs predicted 
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Figure 7. Actual vs predicted scattered plot by deep MLP 


It is evident that all of the models scored rather well for accuracy, with R-squared values ranging 
from 0.84 to 0.93. The deep MLP model obtained the highest accuracy rating with an R-squared of 0.93. 
With R-squared values of 0.90, the random forest model likewise scored well in terms of accuracy. Despite 
being the simplest model, the linear regression model's accuracy score, which was 0.84, was comparatively 
lower than other models. 


5. CONCLUSION 

A comparative analysis of the use of machine learning algorithms for predicting EVs driving range 
is carried out in this paper. In order to achieve this, we examined a real-world dataset that included various 
factors affecting the EV range. To enhance the quality of our data and facilitate model training, we 
incorporated exploratory data analysis techniques during the data pre-processing phase. These methods 
allowed us to successfully prepare the data and develop a thorough grasp of it. We then implemented and 
assessed the performance of several regression models, which included linear regression, multilayer 
perceptron (MLP), and random forest (RF). Finding the best machine learning strategy for precisely 
forecasting the range of EVs was the main goal of this work. With the help of this study, we were able to 
determine the strategy that provides the best predictive performance for estimating the EV driving range. Our 
study yielded insightful results regarding the use of advanced models to forecast the mileage of EVs. We 
evaluated the performance of different regression models, including linear regression, random forest, and 
deep MLP on a real-world dataset consisting of various factors that affect EV range. Our findings indicated 
that the deep MLP and random forest models outperformed the traditional linear regression algorithm, with 
higher R2 scores and lower MAE and RMSE values. Future research could focus on incorporating additional 
variables, such as battery health and charging infrastructure, traffic patterns, road slope, and driver behaviour 
to further enhance the accuracy of EV range prediction models. Furthermore, XGBOOST and LightGBM 
methods provide distinct opportunities for researchers and practitioners to develop precise, efficient, and 
trustworthy data-driven approaches for EVs energy consumption studies. 
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